Efficient parsing strategies for syntactic analysis of closed captions

نویسنده

Krzysztof Czuba

چکیده

We present an efficient multi-level chart parser that was designed for syntactic analysis of closed captions (subtitles) in a real-time Machine Translation (MT) system. In order to achieve high parsing speed, we divided an existing English grammar into multiple levels. The parser proceeds in stages. At each stage, rules corresponding to only one level are used. A constituent pruning step is added between levels to insure that constituents not likely to be part of the final parse are removed. This results in a significant parse time and ambiguity reduction. Since the domain is unrestricted, out-of-coverage sentences are to be expected and the parser might not produce a single analysis spanning the whole input. Despite the incomplete parsing strategy and the radical pruning, the initial evaluation results show that the loss of parsing accuracy is acceptable. The parsing time favorable compares with a Tomita parser and a chart parser parsing time when run on the same grammar and lexicon. 1 I n t r o d u c t i o n In this paper we present on-going research on parsing closed captions (subtitles) from a news broadcast. The research has been conducted as part of an effort to build a prototype of a real-time Machine Translation (MT) system translating news captions from English into Cantonese (Nyberg and Mitamura, 1997). We describe an efficient multi-level chart parser that was designed to handle the kind of language used in our domain within a time that allows for a real-time automatic translation. In order to achieve high parsing speed, we divided an existing English grammar into multiple levels. The parser proceeds in stages. At each stage, rules corresponding to only one level are used. A constituent pruning step is added between levels to insure that constituents not likely to be part of the final parse are removed. This results in a significant parse time and ambiguity reduction. Since the domain is unrestricted, out-of-coverage sentences are to be expected and the parser might not produce a single analysis spanning the whole input. Thus, the set of final constituents has to be extracted from the chart. Despite the incomplete parsing strategy and the radical pruning, the initial evaluation results show that the loss of parsing accuracy is acceptable. The parsing time favorable compares with a Tomita parser and a chart parser parsing time when run on the same grammar and lexicon. The outline of the paper is as follows. In Section 2 we describe the syntactic and semantic characteristics of the input domain. Section 3 provides a short summary of previous published research. Section 4 gives an overview of requirements on the parsing algorithm posed by our application. Section 5 describes how the grammar was partitioned into levels. Section 6 describes the constituent pruning algorithm that we used. In Section 7 we present the method for extracting final constituents from the chart. Section 8 presents the results of an initial evaluation. Finally, we close with future research in Section 9. 2 C a p t i o n i n g domain Translation of closed captions has been attempted before. (Popowich et al., 1997) describe a system that translates closed captions taken from North American prime tlme TV. In their approach, (Popowich et al., 1997) assume a shallow parsing method that proves effective in achieving broad system coverage required for translation of dosed captions from, e.g., movies. As reported by (Popowich et al., 1997), the shallow analysis performed by the system combined with the transfer-based translation strategy results in a number of sentences that are understandable only in conjunction with the broadcast images. The number of sentencesthat are translated incorrectly is also significant. The parsing scheme described below was used in a pilot Machine Translation system for translation of news captions. The following requirements were posed: a) the translations should not be misleading, b) they can be telegraphic since the input is often in a telegraphic style, c) partial translations are acceptable, d) if no correct translation can be produced then it is preferable to not output any.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Automatic Closed Caption Detection and Filtering in MPEG Videos for Video Structuring

Video structuring is the process of extracting temporal structural information of video sequences and is a crucial step in video content analysis especially for sports videos. It involves detecting temporal boundaries, identifying meaningful segments of a video and then building a compact representation of video content. Therefore, in this paper, we propose a novel mechanism to automatically pa...

متن کامل

برچسب‌زنی خودکار نقش‌های معنایی در جملات فارسی به کمک درخت‌های وابستگی

Automatic identification of words with semantic roles (such as Agent, Patient, Source, etc.) in sentences and attaching correct semantic roles to them, may lead to improvement in many natural language processing tasks including information extraction, question answering, text summarization and machine translation. Semantic role labeling systems usually take advantage of syntactic parsing and th...

متن کامل

Transforming trees into hedges and parsing with "hedgebank" grammars

Finite-state chunking and tagging methods are very fast for annotating nonhierarchical syntactic information, and are often applied in applications that do not require full syntactic analyses. Scenarios such as incremental machine translation may benefit from some degree of hierarchical syntactic analysis without requiring fully connected parses. We introduce hedge parsing as an approach to rec...

متن کامل

A Supervised Semantic Parsing with Lexicon Extension and Syntactic Constraint

Existing semantic parsing research has steadily improved accuracy on a few domains and their corresponding meaning representations. In this paper, we present a novel supervised semantic parsing algorithm, which includes the lexicon extension and the syntactic supervision. This algorithm adopts a large-scale knowledge base from the open-domain Freebase to construct efficient, rich Combinatory Ca...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Efficient parsing strategies for syntactic analysis of closed captions

نویسنده

چکیده

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Automatic Closed Caption Detection and Filtering in MPEG Videos for Video Structuring

برچسب‌زنی خودکار نقش‌های معنایی در جملات فارسی به کمک درخت‌های وابستگی

Transforming trees into hedges and parsing with "hedgebank" grammars

A Supervised Semantic Parsing with Lexicon Extension and Syntactic Constraint

عنوان ژورنال:

اشتراک گذاری